Welcome to CS148 - Data Science Fundamentals! As we're planning to move through topics aggressively in this course, to start out, we'll look to do an end-to-end walkthrough of a datascience project, and then ask you to replicate the code yourself for a new dataset.
Please note: We don't expect you to fully grasp everything happening here in either code or theory. This content will be reviewed throughout the quarter. Rather we hope that by giving you the full perspective on a data science project it will better help to contextualize the pieces as they're covered in class
In that spirit, we will first work through an example project from end to end to give you a feel for the steps involved.
Here are the main steps:
It is best to experiment with real-data as opposed to aritifical datasets.
There are many different open datasets depending on the type of problems you might be interested in!
Here are a few data repositories you could check out:
Below we will run through an California Housing example collected from the 1990's.
We'll start by importing a series of libraries we'll be using throughout the project.
import sys
assert sys.version_info >= (3, 5) # python>=3.5
import sklearn
#assert sklearn.__version__ >= "0.20" # sklearn >= 0.20
import numpy as np #numerical package in python
%matplotlib inline
import matplotlib.pyplot as plt #plotting package
# to make this notebook's output identical at every run
np.random.seed(42)
#matplotlib magic for inline figures
%matplotlib inline
import matplotlib # plotting library
import matplotlib.pyplot as plt
import plotly.io as pio
pio.renderers.default="plotly_mimetype+notebook+pdf"
In this section we will load the dataset, and visualize different features using different types of plots.
Packages we will use:
Note: If you're working in CoLab for this project, the CSV file first has to be loaded into the environment. This can be done manually using the sidebar menu option, or using the following code here.
If you're running this notebook locally on your device, simply proceed to the next step.
#from google.colab import files
#files.upload()
We'll now begin working with Pandas. Pandas is the principle library for data management in python. It's primary mechanism of data storage is the dataframe, a two dimensional table, where each column represents a datatype, and each row a specific data element in the set.
To work with dataframes, we have to first read in the csv file and convert it to a dataframe using the code below.
# We'll now import the holy grail of python datascience: Pandas!
import pandas as pd
housing = pd.read_csv('datasets/housing/housing.csv')
housing.head() # show the first few elements of the dataframe
# typically this is the first thing you do
# to see how the dataframe looks like
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
A dataset may have different types of features
The two categorical features are essentialy the same as you can always map a categorical string/character to an integer.
In the dataset example, all our features are real valued floats, except ocean proximity which is categorical.
# to see a concise summary of data types, null values, and counts
# use the info() method on the dataframe
housing.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 20640 non-null float64 1 latitude 20640 non-null float64 2 housing_median_age 20640 non-null float64 3 total_rooms 20640 non-null float64 4 total_bedrooms 20433 non-null float64 5 population 20640 non-null float64 6 households 20640 non-null float64 7 median_income 20640 non-null float64 8 median_house_value 20640 non-null float64 9 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB
# you can access individual columns similarly
# to accessing elements in a python dict
housing["ocean_proximity"].head() # added head() to avoid printing many columns..
0 NEAR BAY 1 NEAR BAY 2 NEAR BAY 3 NEAR BAY 4 NEAR BAY Name: ocean_proximity, dtype: object
# to access a particular row we can use iloc
housing.iloc[1]
longitude -122.22 latitude 37.86 housing_median_age 21.0 total_rooms 7099.0 total_bedrooms 1106.0 population 2401.0 households 1138.0 median_income 8.3014 median_house_value 358500.0 ocean_proximity NEAR BAY Name: 1, dtype: object
# one other function that might be useful is
# value_counts(), which counts the number of occurences
# for categorical features
housing["ocean_proximity"].value_counts()
<1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: ocean_proximity, dtype: int64
# The describe function compiles your typical statistics for each
# column
housing.describe()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
| mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
| std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
| 50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
# We can draw a histogram for each of the dataframes features
# using the hist function
housing.hist(bins=50, figsize=(20,15))
# save_fig("attribute_histogram_plots")
plt.show() # pandas internally uses matplotlib, and to display all the figures
# the show() function must be called
# if you want to have a histogram on an individual feature:
housing["median_income"].hist()
plt.show()
We can convert a floating point feature to a categorical feature by binning or by defining a set of intervals.
For example, to bin the households based on median_income we can use the pd.cut function
# assign each bin a categorical value [1, 2, 3, 4, 5] in this case.
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
housing["income_cat"].value_counts()
3 7236 2 6581 4 3639 5 2362 1 822 Name: income_cat, dtype: int64
housing["income_cat"].hist()
<AxesSubplot: >
## here's a not so interestting way plotting it
housing.plot(kind="scatter", x="longitude", y="latitude")
<AxesSubplot: xlabel='longitude', ylabel='latitude'>
# we can make it look a bit nicer by using the alpha parameter,
# it simply plots less dense areas lighter.
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
<AxesSubplot: xlabel='longitude', ylabel='latitude'>
# A more interesting plot is to color code (heatmap) the dots
# based on income. The code below achieves this
# Please note: In order for this to work, ensure that you've loaded an image
# of california (california.png) into this directory prior to running this
import matplotlib.image as mpimg
california_img=mpimg.imread('images/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=housing['population']/100, label="Population",
c="median_house_value", cmap=plt.get_cmap("jet"),
colorbar=False, alpha=0.4,
)
# overlay the califronia map on the plotted scatter plot
# note: plt.imshow still refers to the most recent figure
# that hasn't been plotted yet.
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
# setting up heatmap colors based on median_house_value feature
prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cb = plt.colorbar()
cb.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cb.set_label('Median House Value', fontsize=16)
plt.legend(fontsize=16)
plt.show()
/var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/4088860129.py:26: UserWarning: FixedFormatter should only be used together with FixedLocator
Not suprisingly, the most expensive houses are concentrated around the San Francisco/Los Angeles areas.
Up until now we have only visualized feature histograms and basic statistics.
When developing machine learning models the predictiveness of a feature for a particular target of intrest is what's important.
It may be that only a few features are useful for the target at hand, or features may need to be augmented by applying certain transfomrations.
None the less we can explore this using correlation matrices.
corr_matrix = housing.corr()
/var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/2466220658.py:1: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
# for example if the target is "median_house_value", most correlated features can be sorted
# which happens to be "median_income". This also intuitively makes sense.
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
# the correlation matrix for different attributes/features can also be plotted
# some features may show a positive correlation/negative correlation or
# it may turn out to be completely random!
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
array([[<AxesSubplot: xlabel='median_house_value', ylabel='median_house_value'>,
<AxesSubplot: xlabel='median_income', ylabel='median_house_value'>,
<AxesSubplot: xlabel='total_rooms', ylabel='median_house_value'>,
<AxesSubplot: xlabel='housing_median_age', ylabel='median_house_value'>],
[<AxesSubplot: xlabel='median_house_value', ylabel='median_income'>,
<AxesSubplot: xlabel='median_income', ylabel='median_income'>,
<AxesSubplot: xlabel='total_rooms', ylabel='median_income'>,
<AxesSubplot: xlabel='housing_median_age', ylabel='median_income'>],
[<AxesSubplot: xlabel='median_house_value', ylabel='total_rooms'>,
<AxesSubplot: xlabel='median_income', ylabel='total_rooms'>,
<AxesSubplot: xlabel='total_rooms', ylabel='total_rooms'>,
<AxesSubplot: xlabel='housing_median_age', ylabel='total_rooms'>],
[<AxesSubplot: xlabel='median_house_value', ylabel='housing_median_age'>,
<AxesSubplot: xlabel='median_income', ylabel='housing_median_age'>,
<AxesSubplot: xlabel='total_rooms', ylabel='housing_median_age'>,
<AxesSubplot: xlabel='housing_median_age', ylabel='housing_median_age'>]],
dtype=object)
# median income vs median house vlue plot plot 2 in the first row of top figure
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
plt.axis([0, 16, 0, 550000])
(0.0, 16.0, 0.0, 550000.0)
# obtain new correlations
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
/var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/313240856.py:2: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
median_house_value 1.000000 median_income 0.688075 total_rooms 0.134153 housing_median_age 0.105623 households 0.065843 total_bedrooms 0.049686 population -0.024650 longitude -0.045967 latitude -0.144160 Name: median_house_value, dtype: float64
New features can be created by combining different columns from our data set.
housing["rooms_per_household"] = housing["total_rooms"]/(housing["households"]+1)
housing["bedrooms_per_room"] = housing["total_bedrooms"]/(housing["total_rooms"]+1)
housing["population_per_household"]=housing["population"]/(housing["households"]+1)
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()
# have you noticed when looking at the dataframe summary certain rows
# contained null values? we can't just leave them as nulls and expect our
# model to handle them for us...
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | rooms_per_household | bedrooms_per_room | population_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 290 | -122.16 | 37.77 | 47.0 | 1256.0 | NaN | 570.0 | 218.0 | 4.3750 | 161900.0 | NEAR BAY | 3 | 5.735160 | NaN | 2.602740 |
| 341 | -122.17 | 37.75 | 38.0 | 992.0 | NaN | 732.0 | 259.0 | 1.6196 | 85100.0 | NEAR BAY | 2 | 3.815385 | NaN | 2.815385 |
| 538 | -122.28 | 37.78 | 29.0 | 5154.0 | NaN | 3741.0 | 1273.0 | 2.5762 | 173400.0 | NEAR BAY | 2 | 4.045526 | NaN | 2.936421 |
| 563 | -122.24 | 37.75 | 45.0 | 891.0 | NaN | 384.0 | 146.0 | 4.9489 | 247100.0 | NEAR BAY | 4 | 6.061224 | NaN | 2.612245 |
| 696 | -122.10 | 37.69 | 41.0 | 746.0 | NaN | 387.0 | 161.0 | 3.9063 | 178400.0 | NEAR BAY | 3 | 4.604938 | NaN | 2.388889 |
sample_incomplete_rows.dropna(subset=["total_bedrooms"]) # option 1: simply drop rows that have null values
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | rooms_per_household | bedrooms_per_room | population_per_household |
|---|
sample_incomplete_rows.drop("total_bedrooms", axis=1) # option 2: drop the complete feature
| longitude | latitude | housing_median_age | total_rooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | rooms_per_household | bedrooms_per_room | population_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 290 | -122.16 | 37.77 | 47.0 | 1256.0 | 570.0 | 218.0 | 4.3750 | 161900.0 | NEAR BAY | 3 | 5.735160 | NaN | 2.602740 |
| 341 | -122.17 | 37.75 | 38.0 | 992.0 | 732.0 | 259.0 | 1.6196 | 85100.0 | NEAR BAY | 2 | 3.815385 | NaN | 2.815385 |
| 538 | -122.28 | 37.78 | 29.0 | 5154.0 | 3741.0 | 1273.0 | 2.5762 | 173400.0 | NEAR BAY | 2 | 4.045526 | NaN | 2.936421 |
| 563 | -122.24 | 37.75 | 45.0 | 891.0 | 384.0 | 146.0 | 4.9489 | 247100.0 | NEAR BAY | 4 | 6.061224 | NaN | 2.612245 |
| 696 | -122.10 | 37.69 | 41.0 | 746.0 | 387.0 | 161.0 | 3.9063 | 178400.0 | NEAR BAY | 3 | 4.604938 | NaN | 2.388889 |
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True) # option 3: replace na values with median values
bpr_median = housing["bedrooms_per_room"].median()
sample_incomplete_rows["bedrooms_per_room"].fillna(bpr_median, inplace=True)
sample_incomplete_rows
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | rooms_per_household | bedrooms_per_room | population_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 290 | -122.16 | 37.77 | 47.0 | 1256.0 | 435.0 | 570.0 | 218.0 | 4.3750 | 161900.0 | NEAR BAY | 3 | 5.735160 | 0.203041 | 2.602740 |
| 341 | -122.17 | 37.75 | 38.0 | 992.0 | 435.0 | 732.0 | 259.0 | 1.6196 | 85100.0 | NEAR BAY | 2 | 3.815385 | 0.203041 | 2.815385 |
| 538 | -122.28 | 37.78 | 29.0 | 5154.0 | 435.0 | 3741.0 | 1273.0 | 2.5762 | 173400.0 | NEAR BAY | 2 | 4.045526 | 0.203041 | 2.936421 |
| 563 | -122.24 | 37.75 | 45.0 | 891.0 | 435.0 | 384.0 | 146.0 | 4.9489 | 247100.0 | NEAR BAY | 4 | 6.061224 | 0.203041 | 2.612245 |
| 696 | -122.10 | 37.69 | 41.0 | 746.0 | 435.0 | 387.0 | 161.0 | 3.9063 | 178400.0 | NEAR BAY | 3 | 4.604938 | 0.203041 | 2.388889 |
Now that we've played around with this, lets finalize this approach by replacing the nulls in our final dataset
housing["total_bedrooms"].fillna(median, inplace=True)
bpr_median = housing["bedrooms_per_room"].median()
housing["bedrooms_per_room"].fillna(bpr_median, inplace=True) # option 3: replace na values with median values
Could you think of another plausible imputation for this dataset?
So we're almost ready to feed our dataset into a machine learning model, but we're not quite there yet!
Generally speaking all models can only work with numeric data, which means that if you have Categorical data you want included in your model, you'll need to do a numeric conversion. We'll explore this more later, but for now we'll take one approach to converting our ocean_proximity field into a numeric one.
from sklearn.preprocessing import LabelEncoder
# creating instance of labelencoder
labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
housing['ocean_proximity'] = labelencoder.fit_transform(housing['ocean_proximity'])
housing.head()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | income_cat | rooms_per_household | bedrooms_per_room | population_per_household | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | 3 | 5 | 6.929134 | 0.146425 | 2.535433 |
| 1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | 3 | 5 | 6.232660 | 0.155775 | 2.107989 |
| 2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | 3 | 5 | 8.241573 | 0.129428 | 2.786517 |
| 3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | 3 | 4 | 5.790909 | 0.184314 | 2.536364 |
| 4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | 3 | 3 | 6.257692 | 0.171990 | 2.173077 |
After having cleaned your dataset you're ready to train your machine learning model.
To do so you'll aim to divide your data into:
In some cases you might also have a validation set as well for tuning hyperparameters (don't worry if you're not familiar with this term yet..)
In supervised learning setting your train set and test set should contain (feature, target) tuples.
We will make use of scikit-learn python package for preprocessing.
Scikit learn is pretty well documented and if you get confused at any point simply look up the function/object!
from sklearn.model_selection import StratifiedShuffleSplit
# let's first start by creating our train and test sets
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
train_set = housing.loc[train_index]
test_set = housing.loc[test_index]
housing_training = train_set.drop("median_house_value", axis=1) # drop labels for training set features
# the input to the model should not contain the true label
housing_labels = train_set["median_house_value"].copy()
housing_testing = test_set.drop("median_house_value", axis=1) # drop labels for training set features
# the input to the model should not contain the true label
housing__test_labels = test_set["median_house_value"].copy()
Once we have prepared the dataset it's time to choose a model.
As our task is to predict the median_house_value (a floating value), regression is well suited for this.
from sklearn.linear_model import LinearRegression
print(housing_training.isna().any())
lin_reg = LinearRegression()
lin_reg.fit(housing_training, housing_labels)
longitude False latitude False housing_median_age False total_rooms False total_bedrooms False population False households False median_income False ocean_proximity False income_cat False rooms_per_household False bedrooms_per_room False population_per_household False dtype: bool
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
# let's try our model on a few testing instances
data = housing_testing.iloc[:5]
labels = housing__test_labels.iloc[:5]
print("Predictions:", lin_reg.predict(data))
print("Actual labels:", list(labels))
Predictions: [423994.82351803 299106.45227567 227754.6073794 185058.44480231 244945.84152691] Actual labels: [500001.0, 162500.0, 204600.0, 159700.0, 184000.0]
We can evaluate our model using certain metrics, a fitting metric for regresison is the mean-squared-loss
$$L(\hat{Y}, Y) = \sum_i^N (\hat{y_i} - y_i)^2$$where $\hat{y}$ is the predicted value, and y is the ground truth label.
from sklearn.metrics import mean_squared_error
preds = lin_reg.predict(housing_testing)
mse = mean_squared_error(housing__test_labels, preds)
rmse = np.sqrt(mse)
rmse
67427.31192602114
Is this a good result? What do you think an acceptable error rate is for this sort of problem?
Ok now it's time to get to work! We will apply what we've learnt to another dataset (airbnb dataset). For this project we will attempt to predict the airbnb rental price based on other features in our given dataset.
Let's do the following set of tasks to get us warmed up:
import pandas as pd
airbnb = pd.read_csv('datasets/airbnb/AB_NYC_2019.csv')
airbnb.drop(["name", "host_id", "host_name", "last_review", "neighbourhood"], axis=1, inplace=True)
airbnb.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 48895 entries, 0 to 48894 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 48895 non-null int64 1 neighbourhood_group 48895 non-null object 2 latitude 48895 non-null float64 3 longitude 48895 non-null float64 4 room_type 48895 non-null object 5 price 48895 non-null int64 6 minimum_nights 48895 non-null int64 7 number_of_reviews 48895 non-null int64 8 reviews_per_month 38843 non-null float64 9 calculated_host_listings_count 48895 non-null int64 10 availability_365 48895 non-null int64 dtypes: float64(3), int64(6), object(2) memory usage: 4.1+ MB
Let's try another popular python graphics library: Plotly.
You can find documentation and all the examples you'll need here: Plotly Documentation
Let's start out by getting a better feel for the distribution of rentals in the market.
neighbourhood_groups in the dataset)####¶import plotly.express as px
fig = px.pie(airbnb, values='calculated_host_listings_count', names='neighbourhood_group', title="Rental Units per Neighbourhood Group")
fig.show()
We now want to see the total number of reviews left for each neighborhood group in the form of a Bar Chart (where the X-axis is the neighbourhood group and the Y-axis is a count of review.
This is a two step process:
reviews = airbnb.groupby(['neighbourhood_group']).sum().reset_index(level=0)
fig = px.bar(reviews, x="neighbourhood_group", y="number_of_reviews", title="Total Number of Reviews per Neighborhood Group")
fig.show()
/var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/3051743445.py:1: FutureWarning: The default value of numeric_only in DataFrameGroupBy.sum is deprecated. In a future version, numeric_only will default to False. Either specify numeric_only or select only columns which should be valid for the function.
For reference you can use the Matplotlib code above to replicate this graph here.
import matplotlib.image as mpimg
import matplotlib as mpl
nyc_img=mpimg.imread('images/nyc.png')
colors = {"Entire home/apt": "blue", "Private room": "green", "Shared room": "red"}
mp_color = [colors[t] for t in airbnb['room_type']]
#labelencoder = LabelEncoder()
# Assigning numerical values and storing in another column
#airbnb['room_type'] = labelencoder.fit_transform(airbnb['room_type'])
ax = airbnb.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
s=airbnb["price"]/10, label="Price",
c=mp_color, cmap=plt.get_cmap("jet"),
colorbar=False, alpha=0.4, title="Airbnb Locations in New York City"
)
# overlay the califronia map on the plotted scatter plot
# note: plt.imshow still refers to the most recent figure
# that hasn't been plotted yet.
plt.imshow(nyc_img, extent=[-74.252, -73.71, 40.49, 40.92], alpha=0.5,
cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
norm = mpl.colors.Normalize(vmin=0, vmax=2)
# creating ScalarMappable
cmap = plt.get_cmap('jet', 3)
sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
sm.set_array([])
# setting up heatmap colors based on median_house_value feature\
cb = plt.colorbar(sm,ticks=[0,1,2])
cb.ax.set_yticklabels(["Entire home/apt", "Private room", "Shared room"], fontsize=14)
#cb.set_label('Room Type', fontsize=16)
#40.49979 40.91306 -74.24442 -73.71299
plt.legend(fontsize=16)
plt.show()
/Users/evanhe/Desktop/project1/venv/lib/python3.11/site-packages/pandas/plotting/_matplotlib/core.py:1259: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored /var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/1127140698.py:29: MatplotlibDeprecationWarning: Unable to determine Axes to steal space for Colorbar. Using gca(), but will raise in the future. Either provide the *cax* argument to use as the Axes for the Colorbar, provide the *ax* argument to steal space from it, or add *mappable* to an Axes.
Now try to recreate this plot using Plotly's Scatterplot functionality. Note that the increased interactivity of the plot allows for some very cool functionality
import plotly.graph_objects as go
from PIL import Image
fig = go.Figure()
img = Image.open("images/nyc.png")
fig.add_trace(go.Scatter(x=airbnb["longitude"], y=airbnb["latitude"], mode="markers"))
fig.add_layout_image(
dict(
source=img,
xref="x",
yref="y",
x=-74.26,
y=40.93,
sizex=.56,
sizey=.45,
sizing="stretch",
opacity=0.5,
layer="below")
)
fig.update_layout(
title={
'text': "Airbnb Locations in New York City",
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'})
fig.show()
Like with the previous example you'll have to do a little bit of data engineering before you actually generate the plot.
Generally I'd recommend the following series of steps:
means = airbnb[(airbnb['number_of_reviews'] >= 10) & (airbnb['neighbourhood_group'] == 'Brooklyn')].groupby('room_type')["price"].mean().reset_index()
px.scatter(means, x='room_type', y='price', title='Averge Price of Room Types in Brooklyn with >= 10 Reviews')
Let's create a new binned feature, price_cat that will divide our dataset into quintiles (1-5) in terms of price level (you can choose the levels to assign)
Do a value count to check the distribution of values
bins = [-np.inf,50,100,250,500,np.inf]
labels=[1,2,3,4,5]
airbnb["price_cat"] = pd.cut(airbnb["price"], bins=bins, labels=labels)
print(airbnb["price_cat"].value_counts())
3 19759 2 17367 1 6561 4 4164 5 1044 Name: price_cat, dtype: int64
Now engineer at least one new feature.
bins = [-np.inf,75,150,225,300,np.inf]
labels=[1,2,3,4,5]
airbnb['availability_cat'] = pd.cut(airbnb['availability_365'], bins=bins, labels=labels)
print(airbnb['availability_cat'].value_counts())
1 27136 5 8108 2 4946 3 4542 4 4163 Name: availability_cat, dtype: int64
Determine if there are any null-values and if there are impute them.
sample_incomplete_rows = airbnb[airbnb.isnull().any(axis=1)]
median_rpm = airbnb["reviews_per_month"].median()
airbnb["reviews_per_month"].fillna(median_rpm, inplace=True)
airbnb.isna().any() # all features have no null values
id False neighbourhood_group False latitude False longitude False room_type False price False minimum_nights False number_of_reviews False reviews_per_month False calculated_host_listings_count False availability_365 False price_cat False availability_cat False dtype: bool
Finally, review what features in your dataset are non-numeric and convert them.
labelencoder = LabelEncoder()
airbnb['neighbourhood_group'] = labelencoder.fit_transform(airbnb['neighbourhood_group'])
airbnb['room_type'] = labelencoder.fit_transform(airbnb['room_type'])
print(airbnb.select_dtypes(exclude=np.number).any()) # show no non-numeric features
Series([], dtype: bool)
/var/folders/dj/7xcfbq6x7b5bfcjj933s9n5r0000gn/T/ipykernel_37506/4227843893.py:4: FutureWarning: The default value of bool_only in DataFrame.any is deprecated. In a future version, it will default to False. In addition, specifying 'bool_only=None' is deprecated. Select only valid columns or specify the value of bool_only to silence this warning.
Using our StratifiedShuffleSplit function example from above, let's split our data into a 80/20 Training/Testing split using neighbourhood_group to partition the dataset
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(airbnb, airbnb["neighbourhood_group"]):
train_set = airbnb.loc[train_index]
test_set = airbnb.loc[test_index]
Finally, remove your labels price from your testing and training cohorts, and create separate label features.
airbnb_training = train_set.drop("price", axis=1) # drop labels for training set features
# the input to the model should not contain the true label
airbnb_labels = train_set["price"].copy()
airbnb_testing = test_set.drop("price", axis=1) # drop labels for training set features
# the input to the model should not contain the true label
airbnb__test_labels = test_set["price"].copy()
The task is to predict the price, you could refer to the housing example on how to train and evaluate your model using MSE. Provide both test and train set MSE values.
lin_reg2 = LinearRegression()
lin_reg2.fit(airbnb_training, airbnb_labels)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
test_preds = lin_reg2.predict(airbnb_testing)
test_mse = mean_squared_error(airbnb__test_labels, test_preds)
train_preds = lin_reg2.predict(airbnb_training)
train_mse = mean_squared_error(airbnb_labels, train_preds)
print("Test MSE:", test_mse)
print("Train MSE:", train_mse)
Test MSE: 41528.768923969954 Train MSE: 40983.52191032838